Cook, Paul, Jey Han Lau, Michael Rundell, Diana McCarthy and Timothy Baldwin (to appear) A lexicographic appraisal of an automatic approach for detecting new word senses, In Proceedings of eLex 2013, Tallinn, Estonia

نویسندگان

  • Paul Cook
  • Jey Han Lau
  • Michael Rundell
  • Diana McCarthy
  • Timothy Baldwin
چکیده

Over the last 20 or so years, lexicographical tasks such as finding collocations and selecting examples have been automated to some degree, both supplementing lexicographers’ intuitions with empirical data, and reducing the “drudgery” of lexicography to allow lexicographers to focus on tasks which can’t easily be automated. Automated determination of word senses and identification of usages of a given sense, however, have proven difficult due to their covert nature. In this paper we present a method, based on an automatic wordsense induction system, for identifying novel word-senses in a more recent Focus Corpus with respect to an older Reference Corpus. We evaluate this method in the context of updating a dictionary, and find that it could be a useful lexicographical tool for identifying new senses, and also dictionary entries whose definitions or examples should be updated. 1. Updating dictionaries Lexicography is expensive. Despite the falling cost of corpus resources, the process of compiling and editing dictionary text remains labour-intensive. This applies not only to developing new resources from scratch, but also to the (more usual) job of updating existing dictionaries. One promising strategy for publishers is to automate some of the editorial tasks, and significant progress has been made in this area over the last ten years (Kilgarriff and Rychlý, 2010; Rundell and Kilgarriff, 2011; Rundell, 2012). Briefly, corpus-analysis software can aid in: (1) the determination of the syntactic, collocational, and text-type preferences of a given word or meaning; (2) the selection of a shortlist of suitable example sentences; and (3) (at a later stage) streamlining of the process of editing and finalizing dictionary text. The current approach to dictionary development has the software presenting data to the lexicographer in a usefully predigested form. But recent advances offer the prospect of a model where “the software selects what it believes to be relevant data and actually populates the appropriate fields in the dictionary database” (Rundell and Kilgarriff, 2011, page 278) — leaving the human expert to validate (or refine, or reject) decisions made by the computer. The various components of this model have all been trialled on real dictionary projects, providing the conditions for incremental improvements in performance. The GDEX software, for instance, which automatically finds appropriate dictionary examples in a corpus, was used initially on a project at Macmillan, when there was a requirement for a large number of new example sentences for specific collocational pairs (Kilgarriff et al., 2008). The results were uneven but broadly positive, with the editorial team completing the task more quickly than if they had taken a purely “manual” route. Versions of GDEX have since been used in other ventures. The heuristics and weightings have been optimised for a number of languages (e.g., Kosem et al., 2011), and the software is now a standard feature in the editorial toolkit of several dictionary developers. In other areas, progress towards automation has been slower. But the direction of travel is clear: we are gradually putting together a suite of robust applications which, collectively, streamline the job of compiling and editing dictionary text. If the effect of all this is to transfer some lexicographic tasks from humans to machines, the goal is to produce better dictionaries at a lower cost. A striking outcome of work done so far in this area is that automation not only delivers efficiency savings but also leads to improvements in quality. Automating a process forces us to go back to first principles and be explicit about what the task involves. What, for instance, are the features of a “good” dictionary example, or at what point can we say with confidence that a particular syntactic pattern is “typical” of a word? All of which is contributing to the goal of producing dictionaries that are more systematic, more internally-consistent, and less reliant on the subjective judgment of individual lexicographers. Improving the language-description process presupposes having some language that needs describing. Methodologies for extracting candidate headword lists from corpora are already well-established. Meanwhile, the requirement for tracking language change (more pressing than ever now that most dictionaries are online and their users expect them to be up-to-date) is also being addressed, and the task of identifying emerging new words is benefitting from computational approaches (Rundell and Kilgarriff, 2011, pages 263–267). But (notwithstanding the media’s obsession with shiny new headwords) there is more to updating a dictionary than adding neologisms. Two other salient aspects of keeping a dictionary up to date are: finding novel senses of existing words, and ensuring that dictionary entries reflect contemporary conditions and technologies. From the 1980s, as computer technology moved out of its specialist ghetto to become part of most people’s everyday experience, words like mouse, icon, virus and window acquired new senses. (The word computer itself, for that matter, began life in the 17th century as a job title for someone whose work involved calculations.) Earlier dictionaries do not include these meanings, so they had to be added. More recent examples include words like cloud and tablet, hybrid (a type of car), sick (used in contemporary slang as a term of approval), and toxic (when referring to financial assets or debts). None of these meanings existed when the Macmillan Dictionary was first published (in print form) in 2002, and all have been added to the online edition (Macmillan English Dictionary Online, hereafter MEDO).1 An equally important, but more elusive, goal is to ensure that definitions and examples reflect contemporary realities. In recent updates to MEDO, for example, changes have been made to the definitions of meeting (participants don’t have to be in the same location), marriage (not just between a man and woman), and indeed dictionary (no longer simply “a book which ...”). MEDO has also targeted example sentences with dated contexts, like this one exemplifying one of the meanings of the verb slot: (1) She slotted another tape into the cassette player. Traditionally, these are labour-intensive operations. In an ideal world, a well-funded editorial team would carefully review every entry, consulting contemporary corpus data, and identify anything that needed changing or updating. This is increasingly impracticable. Budget constraints weigh heavily on most non-commercial institutions, while commercial lexicography is in the process of replacing a simple and reliable business model (selling books) with something more complex and (for the time being) less profitable. So, for the sake of both systematicity and feasibility within limited budgets, it makes sense to see how far we can automate the tasks of finding novel senses and identifying other areas of the text that might need updating. In this paper we examine a previously-proposed technique for automatically identifying word-senses that are new to one corpus with respect to another (Lau et al., 2012), based on an automatic word-sense induction system. We propose a further extension to that system which can incorporate human intuitions about topics for which we expect to see many new word-senses. We describe our previous evaluations of the core system, and its ability to identify new word-senses. We then present a new evaluation of our proposed method in the context of updating a dictionary, in collaboration with a professional lexicographer (the thirdnamed author of this paper). Our findings suggest that this method could indeed be a useful new addition to the lexicographer’s toolkit. http://www.macmillandictionary.com/ 2. Automatic novel sense detection Word-sense induction (WSI) is the task of automatically grouping the usages of a given word in a corpus according to sense, such that all usages exhibiting a particular sense are in the same group, and each group includes usages corresponding to only one sense (Navigli, 2009). The category “word-sense” is not of course uncontroversial. There is no general agreement about what constitutes a discrete meaning of a word, and dictionaries often exhibit considerable variation in their treatment of the same polysemous word. But although word meanings are unstable entities, often with shifting boundaries, dictionary conventions traditionally require that lemmas are divided up into numbered senses, and a good lexicographers’ style guide will provide criteria for doing this.2 Here we describe a WSI technique we developed and its application to the task of identifying novel word-senses. The WSI methodology we use is based on a model we previously proposed (Lau et al., 2012). The core machinery of this method is driven by probabilistic topic models (Latent Dirichlet Allocation, LDA: Blei et al. (2003)), where latent or unseen topics are viewed as the driving force for generating the words in text documents. In this model a document is viewed as a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distributions for documents and topics are automatically “learned” from the corpus. Crucially, the “topics” in a topic model do not necessarily correspond to topics in the sense of the subject of a text. Applying topic models to induce the word-senses of a lemma of interest, these “topics” are interpreted as the induced senses. In traditional topic models, the number of topics to be learnt is a parameter that must be set manually in advance. In WSI, this parameter translates to the number of senses to be induced for a lemma. To develop a model without this requirement, and which can learn varying numbers of senses for different lemmas as appropriate, we used a Hierarchical Dirichlet Process (HDP, Teh et al., 2006), a variant of LDA that also learns an appropriate number of topics/senses. Following our previous work, for each usage of a target lemma we extract a three-sentence context, where the second sentence contains the usage of the lemma, and the first and third sentence are the preceding and succeeding sentences, respectively. These three-sentence snippets are viewed as the “documents” in the topic model. We represent each document as the bag-of-words it contains, as is common for topic models.3 We also include additional positional word information to represent the local For a full discussion of word senses, see Hanks (2013, pages 65–83). We use the term bag-of-words to refer to the multiset of items occurring in some context, as it is commonly used in natural language processing. As described in Sections 3. and 4.1. we lemmatise our corpora. Our “bag-of-words” representation is therefore in fact a bag-of-lemmas. context of the target lemma. Specifically, we introduce an additional word feature for each of the three words to the left and right of the target lemma. An example of the features is given in Table 1. To illustrate the senses induced by our model and the usages that correspond to the senses, we present Table 2 and Table 3 respectively, for the example lemma cheat. To identify novel senses, we compare a Focus Corpus with a Reference Corpus. In the application we consider here of updating a dictionary, the Focus Corpus would consist of newer texts; the Reference Corpus, on the other hand, would be older material, and common usages in this corpus would be expected to be reflected in the dictionary. (Details of the Reference and Focus Corpora used in this study are given in Section 4.1.) We combine the Focus and Reference Corpora to produce a supercorpus. For a given lemma of interest we then apply our WSI methodology to all of its usages in this supercorpus. (In this study we consider all lemmas meeting some frequency and keywordness cutoffs, also described in Section 4.1.) The WSI step automatically labels each usage of the lemma with its induced sense. We then calculate the “novelty” of an induced sense in the Focus Corpus as the ratio of its relative frequency in the Focus and Reference Corpora, akin to a simple approach to keywords (Kilgarriff, 2009), but applied to induced senses. We rank the lemmas according to the novelty of their highest-scoring induced sense. The highest-scoring induced sense for a given lemma is referred to as its novel sense. New senses often arise for prominent cultural concepts (Ayto, 2006). In this paper we introduce a new variant to our method for identifying novel senses that incorporates this observation. We first manually form a list of terms related to a particular topic (computing and the internet for the analysis presented in Sections 4. and 5.). For each induced sense we then determine its relevance to this topic based on its probability distribution over words from the topic modeller. We independently rank each induced sense by its relevance and its novelty score, and then rank each induced sense by the sum of its rank under each of these two rankings. This approach identifies induced senses which are both novel and related to a particular topic, and is referred to as “rank sum”. 3. Previous evaluation In this section we describe previously-presented evaluations of the WSI component of our method on several benchmarked WSI tasks, and an evaluation of the accuracy of our method for detecting whether a given word exhibits a novel sense in a more recent Focus Corpus compared to an older Reference Corpus and furthermore, whether it can detect specific instances of a novel sense within the Focus Corpus. In Sections 4. and 5. we present a new evaluation of our method for identifying novel senses in the context of updating a dictionary. Our WSI technique was first presented in Lau et al. (2012), and was initially evaluated on two datasets (Agirre and Soroa, 2007; Manandhar et al., 2010) to compare the system to the state-of-the-art in WSI. These datasets were produced within the auspices a series of international events (SemEval, formerly SENSEVAL) for the objective comparison of computational systems that provide semantic analysis. Both datasets require the systems to induce senses for a sample of lemmas from some training data and then label some unseen data with these senses. From the evaluation, our system outperformed the state-of-the-art systems, given the same conditions for tuning parameters. Moreover, on the more recent 2010 dataset our model which uses HDP to automatically learn the optimal number of topics (senses), outperformed a more basic LDA model even when the latter was manually told how many topics to learn. More recently we evaluated our WSI technique by participating in two SemEval 2013 WSI tasks. “Word Sense Induction for Graded and Non-Graded Senses” (Jurgens and Klapaftis, 2013) was similar to the previous WSI evaluations considered, but additionally required systems to identify not just the single most-appropriate induced sense for a given test usage, but rather all applicable senses and the extent to which they apply. In this evaluation a number of different metrics were considered, with our method outperforming all other participating systems in terms of one metric, and achieving strong results overall (Lau et al., 2013a). “Evaluating Word Sense Induction & Disambiguation within an End-User Application” (Navigli and Vannella, 2013) considered whether WSI can be applied to diversify search engine results. In this task our system performed best out of all participating systems, further demonstrating the effectiveness of our WSI approach (Lau et al., 2013b). To evaluate the application of our WSI method for novel sense detection, our earlier work (Lau et al., 2012) provided the first, and to date only, available dataset, albeit a relatively small one. The production of such a dataset is difficult because word senses are covert and manually labelling occurrences in a corpus is a very time-consuming and laborious process. We focused on a small sample of lemmas which were identified as having senses arising in the period between the early nineties and 2007. This period was selected simply because of the availability of a Reference Corpus, the British National Corpus (BNC, Burnard, 1995), and a more recent Focus Corpus, the ukWaC (Ferraresi et al., 2008), produced automatically from data from the Web in 2007.4 Since these corpora are of different sizes, the corpora were made more comparable by using only the written portion of the BNC and extracting a similar sized random sample of documents from the ukWaC and using TreeTagger (Schmid, 1994) to tokenise and lemmatise both corpora. We used the Concise Oxford English Dictionary editions which best reflected contemporary usage for the two Note that the new evaluation presented in this paper uses different Reference and Focus Corpora than our earlier work. Target lemma dog Context sentence Most breeds of dogs are at most a few hundred years old Bag-of-word features most, breed, of, be, at, most, a, few, hundred, year, old Positional word features most #−3, breed #−2, of #−1, be #+1, at #+2, most #+3 Table 1: An example of the topic model features. Sense Number Top-10 Terms 1 cheat think want ... love feel tell guy include find 2 cheat student cheating test game school to teacher exam study 3 husband wife cheat wife #1 tiger husband #-1 on ... woman marriage 4 cheat woman relationship cheating partner reason man woman #-1 to spouse 5 cheat game play player cheating poker to card cheated money 6 cheat exchange china chinese foreign cheat #-2 cheat #2 china #-1 to team 7 tina bette kirk walk accuse mon pok symkyn nick star 8 fat jones ashley pen body taste weight expectation parent able 9 euro goal luck fair france irish single 2000 point complain Table 2: The top-10 terms for each of the senses induced for the lemma cheat. respective time periods: Thompson (1995, COD95) and Soanes and Stevenson (2008, COD08). Working on the assumption that new senses often arise for culturally salient concepts (Ayto, 2006), we directed our search towards entries relevant to computing and with sufficient frequency (more than 1000) in the BNC. The lexical selection was supported with a manual inspection of 100 random occurrences from the respective corpora and also a manual inspection of the collocates of the candidate lexemes using word sketches (Kilgarriff and Tugwell, 2002).5 The above procedure yielded five genuine lemmas with a novel sense arising in the respective period.6 We then selected five distractor lemmas with the same part of speech as a target and similar frequency within the BNC but where there was no evidence of a new sense given the respective entries in COD95 and COD08. The automatic WSI method was applied to the similarly sized set of the documents from the BNC and the ukWaC and the output used for ranking the lexical items by their novelty score. The lemmas with a high novelty score had significantly higher ranks compared to the distractors, meanwhile a baseline which only considered the frequency difference across the two corpora did not produce a significant difference in ranking. We additionally used the manually tagged samples to demonstrate that not only could the approach successfully rank lemmas on the basis of novelty, but also it could be used to identify the novel occurrences in the Focus Corpus. Promising results were obtained overall simply by identifying the specific novel sense with the topic that was automatically ranked highest for novelty and using that to identify occurrences. Furthermore, because the induced senses are modelled as lists of salient words, topic models afford a readily interpretable representation for word sense, highlighting the potential for such automatic methods to produce output http://www.sketchengine.co.uk/ The five lemmas were domain (n), export (v), mirror (n), worm (n), and poster (n). that can inform the lexicographic process. 4. Lexicographical evaluation In this section we describe an evaluation of our proposed method for identifying novel word-senses in the context of updating a dictionary, based on manual analysis by a lexicographer. 4.1. Corpora and pre-processing Our previous evaluation of the ability of our WSI method to identify novel senses (presented in Section 3.) used the BNC and ukWaC, corpora which consist of very different genres. For this analysis we consider more-comparable corpora. We use the English Gigaword Fourth Edition (Parker et al., 2009), henceforth referred to as GIGAWORD, which consists of newswire articles from six services including the New York Times Newswire Service; the Los Angeles Times/Washington Post Newswire Service; and the Agence France-Press, English Service for the years 1994–2008.7 For our Reference and Focus Corpora we use the sub-corpora of GIGAWORD for the years 1995 and 2008, respectively, the earliest and latest years in the corpus for which data from all services is available. This gives us Reference and Focus Corpora which are comparable, in that they both consist of newswire data from the same sources for a given year, although there are of course topical differences between the corpora for the two years. Moreover, these corpora are diverse, consisting of data from six sources, although all of the data is from newswires. GIGAWORD consists of several document types with the by far most frequent being “story”, which corresponds to a typical newswire story. We only consider these documents. GIGAWORD is known to contain a substantial number of There is a fifth edition of this corpus which additionally includes data for 2009 and 2010, but we unfortunately do not have a license for this edition of the corpus.

منابع مشابه

Cook, Paul, Michael Rundell, Jey Han Lau and Timothy Baldwin (to appear) Applying a Word-sense Induction System to the Automatic Extraction of Diverse Dictionary Examples, In <em>Proceedings of the XVI EURALEX International Congress (EURALEX 2014), Bolzano, Italy, pp. 15-19

There have been many recent efforts to automate or semi-automate parts of the process of compiling a dictionary, including building headword lists and identifying collocations. The result of these efforts has been both to make lexicographers’ work more efficient, and to improve dictionaries by introducing more systematicity into the process of their construction. One task that has already been ...

متن کامل

Word Sense Induction for Novel Sense Detection

We apply topic modelling to automatically induce word senses of a target word, and demonstrate that our word sense induction method can be used to automatically detect words with emergent novel senses, as well as token occurrences of those senses. We start by exploring the utility of standard topic models for word sense induction (WSI), with a pre-determined number of topics (=senses). We next ...

متن کامل

Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models

Unsupervised word sense disambiguation (WSD) methods are an attractive approach to all-words WSD due to their non-reliance on expensive annotated data. Unsupervised estimates of sense frequency have been shown to be very useful for WSD due to the skewed nature of word sense distributions. This paper presents a fully unsupervised topic modelling-based approach to sense frequency estimation, whic...

متن کامل

Novel Word-sense Identification

Automatic lexical acquisition has been an active area of research in computational linguistics for over two decades, but the automatic identification of new word-senses has received attention only very recently. Previous work on this topic has been limited by the availability of appropriate evaluation resources. In this paper we present the largest corpus-based dataset of diachronic sense diffe...

متن کامل

unimelb: Topic Modelling-based Word Sense Induction for Web Snippet Clustering

This paper describes our system for Task 11 of SemEval-2013. In the task, participants are provided with a set of ambiguous search queries and the snippets returned by a search engine, and are asked to associate senses with the snippets. The snippets are then clustered using the sense assignments and systems are evaluated based on the quality of the snippet clusters. Our system adopts a preexis...

متن کامل

unimelb: Topic Modelling-based Word Sense Induction

This paper describes our system for shared task 13 “Word Sense Induction for Graded and Non-Graded Senses” of SemEval-2013. The task is on word sense induction (WSI), and builds on earlier SemEval WSI tasks in exploring the possibility of multiple senses being compatible to varying degrees with a single contextual instance: participants are asked to grade senses rather than selecting a single s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013